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Abstract 

Population genetic studies have found evidence for dramatic population growth in recent human history. It is unclear how 
this recent population growth, combined with the effects of negative natural selection, has affected patterns of deleterious 
variation, as well as the number, frequency, and effect sizes of mutations that contribute risk to complex traits. Because 
researchers are performing exome sequencing studies aimed at uncovering the role of low-frequency variants in the risk of 
complex traits, this topic is of critical importance. Here I use simulations under population genetic models where a 
proportion of the heritability of the trait is accounted for by mutations in a subset of the exome. I show that recent 
population growth increases the proportion of nonsynonymous variants segregating in the population, but does not affect 
the genetic load relative to a population that did not expand. Under a model where a mutation's effect on a trait is 
correlated with its effect on fitness, rare variants explain a greater portion of the additive genetic variance of the trait in a 
population that has recently expanded than in a population that did not recently expand. Further, when using a single- 
marker test, for a given false-positive rate and sample size, recent population growth decreases the expected number of 
significant associations with the trait relative to the number detected in a population that did not expand. However, in a 
model where there is no correlation between a mutation's effect on fitness and the effect on the trait, common variants 
account for much of the additive genetic variance, regardless of demography. Moreover, here demography does not affect 
the number of significant associations detected. These findings suggest recent population history may be an important 
factor influencing the power of association tests and in accounting for the missing heritability of certain complex traits. 
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Introduction 

Genome-wide association studies (GWAS) have successfully 
detected associations between hundreds of common single 
nucleotide polymorphisms (SNPs) and complex traits in humans 
[1,2]. While this catalog of genes has revealed important biological 
insights, for most traits the discovered associations can account 
only for a small fraction of the heritability measured from family- 
based studies [3]. This difference in the heritability observed in 
familial studies and the heritability explained by associated SNPs 
has been termed "missing heritability," and there is tremendous 
interest in the human genetics community to find it [3-5]. 

One possibility that has received particular attention is that the 
missing heritability lies in rare variants that have large effect sizes 
[3] . Because a risk variant is rare in the population, an association 
between the variant and the phenotypes of interest may not have 
been detected using traditional GWAS. Instead, at present, such 
variants must be assayed through direct sequencing. Due to 
technological advances (e.g. next-generation sequencing) [6,7], 
combined with newer analytical methods designed for analyzing 
full sequence data [8-11], exome and full genome-sequencing 
studies are now being implemented in human genetics. The 
progression to sequencing data has already proved fruitful for the 
identification of causal mutations for several Mendelian diseases 
[6,12-16]. Full sequence data [17,18] is starting to reveal a richer 



picture of low-frequency genetic variation (minor allele frequency 
<0.5%), which may, in turn, increase the community's ability to 
implicate rare variants in risk of complex disease [4,19-25]. 
Further, such studies should allow researchers to empirically 
determine the extent to which rare variants account for the 
missing heritability of complex traits [26-30]. However, before 
these new technological and methodological advances can reach 
their full potential, a more thorough understanding of low- 
frequency genetic variation in multiple human populations is 
essential. 

To learn about patterns of rare genetic variation, several studies 
have sequenced hundreds of genes or complete exomes in 
thousands of individuals [31-34]. These studies have made two 
important discoveries. First, they have found a larger number of 
rare variants than was expected under previous models of human 
population history. It has been argued that this excess of rare 
variants can be explained by the recent explosion in human 
population size [31,32,35-37]. Second, these studies have 
documented a plethora of rare, nonsynonymous SNPs that are 
likely evolutionarily deleterious and may be of medical relevance. 

Comparatively less work has been done, however, to examine 
the implications that recent population history has had on the 
architecture of complex traits (but see the recent paper by Simons 
et al. [38]). It is unclear whether population history, and recent 
population growth in particular, affects the number, frequency, 
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Author Summary 

Many human populations have dramatically expanded 
over the last several thousand years. I use population 
genetic models to investigate how recent population 
expansions affect patterns of mutations that reduce 
reproductive fitness and contribute to the genetic basis 
of complex traits (including common disease). I show that 
recent population growth increases the proportion of 
mutations found in the population that reduce fitness. 
When mutations that have the greatest effect on repro- 
ductive fitness also have the greatest effect on a complex 
trait, more of the heritability of the trait is due to 
mutations at very low-frequency in populations that have 
recently expanded, as compared to populations that have 
not. Also, under this model, for a given sample size and 
false-positive rate, fewer variants show statistically signif- 
icant associations with the trait in the population that has 
expanded than in one that has not. Both of these findings 
suggest that recent population growth may make it more 
difficult to fully elucidate the genetic basis of complex 
traits that are directly or indirectly correlated with 
reproductive fitness. 



and effect sizes of mutations that contribute risk to complex traits. 
Addressing this question is critical for finding the "missing 
heritability" in different populations, as well as performing the 
most powerful association studies to implicate specific variants in 
disease risk. Over a decade ago, it was recognized that the power 
to associate common variants with complex disease varied across 
populations [39-41]. This was largely due to asymmetry in the 
extent of linkage disequilibrium (LD) across populations as a result 
of differences in demographic history [42-44]. While the issue of 
LD is less relevant when considering rare variants, the topic of 
population choice for association studies has received substantially 
less attention when considering rare variants, despite its potential 
importance. 

Here I use population genetic models to investigate the effect of 
recent population growth on patterns of deleterious genetic 
variation, the architecture of complex traits, and the ability to 
associate causal variants with the trait in models where a 
proportion of the trait's heritability is accounted for by mutations 
in a subset of the exome. Specifically, I show that recent 
population growth increases the input of deleterious mutations 
into the population, directly causing a proportional excess of 
deleterious genetic variation segregating in the population. 
Second, if a mutation's effect on reproductive fitness is correlated 
with its effect on a complex trait (such as a disease), I show that 
recent population growth increases the amount of the additive 
genetic variance of the trait that is accounted for by low-frequency 
variants relative to that in a population that did not expand 
recently. Further, I demonstrate that recent population growth 
leads to an increase in the number of alleles that contribute to the 
trait relative to what is expected in a population that did not 
recently expand. Finally, recent population growth decreases the 
number of SNPs that are significantly associated with the trait, 
relative to the number detected in a population that did not 
recently expand. This work indicates that in certain circumstances, 
recent population history will play an important role in 
determining the genetic architecture of complex traits in a 
particular population under study. As such, recent population 
history is a factor that should be considered when designing and 
interpreting re-sequencing studies for complex traits. 



Methods 

Models of population history 

I explore several of models of population history (Fig. 1). 
Because many studies have inferred a population bottleneck in 
non-African human populations associated with the Out-of-Africa 
migration process [45-50], the first model includes a brief, but 
severe, reduction in population size (Fig. 1A). After the bottieneck, 
the population returns to the same size as the ancestral population. 
This model is referred to as "BN" throughout the rest of paper. 
The second model of population history also includes the same 
Out-of-Africa population bottleneck, but now includes an instan- 
taneous, 100-fold population expansion in the last 80 generations, 
or the last 2000 years, assuming 25 years/ generation (Fig. IB) 
[51]. This recent explosion in effective population size is meant to 
approximate the expansion detected in the archeological and 
historical records as well as in studies of genetic variation 
[31,35,36,52]. This model is referred to as "BN+growth" 
throughout the paper. Finally, for comparison purposes, I also 
investigate a model where a population experienced an ancient 2- 
fold expansion (Fig. 1C). Such a model is meant to reflect the 
history of African populations [46,47,53] and is referred to as "Old 
growth" in the paper. 

Forward simulations 

All results were obtained using the forward-in-time population 
genetic program described in Lohmueller et al. [54], with minor 
modifications. Briefly, the program assumes a Wright-Fisher 
model of population history. Each generation, alleles change 
frequency stochastically based on binomial sampling and deter- 
ministically based on the standard selection equations. Also in each 
generation, a Poisson distributed number of new mutations enter 
the population at rate 0/2, where Q = ANiji. Here A^,- is the 
population size in the 2th epoch of population history (see below) 
and )i is the per-chromosome per generation mutation rate across 
all the coding sequence that was simulated. I set ^( = 0.056 for 
synonymous sites. For nonsynonymous sites, j.1 is 2.5 times higher, 
because of the larger number of sites that, when mutated, give rise 
to a nonsynonymous mutation. Selection coefficients for new 
mutations are drawn from a gamma distribution with the 
parameters as inferred in Boyko et al. [46]. As done in the 
Poisson Random Field framework, all mutations are assumed to be 
independent of each other [55]. 

Step-wise population size changes are included in the model by 
changing the population size (jV) at particular time points. Size 
changes affect the number of mutations that enter the population 
during each individual epoch and the magnitude of genetic drift. 

For computational efficiency, I divided the population size by 2 
and rescaled all times to be two-fold smaller than under the 
specified model. However, I keep the population scaled mutation 
rate (9), and the population scaled selection coefficient (y = 2Ns), 
equal to the same values as for the larger population. This 
rescaling is possible because, in the diffusion limit, patterns of 
genetic variation only depend on the scaled parameters. Such a 
rescaling is customary in other forward simulation programs 
[56,57]. Samples of 1000 chromosomes are taken from the 
population at different time points to calculate how diversity 
statistics change over time. 

Models of complex traits 

To evaluate the effect of recent demographic history on the 
architecture of complex traits, I simulate individuals who have a 
quantitative trait. I assume that deleterious (nonsynonymous) 
mutations in a given mutational target account for some of the 
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Figure 1. Models of population size changes over time. (A) A model of European population history with a severe bottleneck starting 2000 
generations ago (BN). (B) A similar model of European population history as shown in (A), except that here the population instantaneously expanded 
100-fold 80 generations ago (BN+growth). (C) A model with a 2-fold ancient expansion (Old growth). This is a possible model for African population 
history. 

doi:10.1371/journal.pgen.1004379.g001 



heritability of the trait. Such a model is implicitly assumed in 
exome re-sequencing studies used to implicate rare variants in 
disease risk. This quantitative trait could represent a trait that is 
measured on a quantitative scale (e.g. lipid levels) or represent the 
underlying risk to a dichotomous phenotype (e.g. diabetes). Below 
I provide a description of the model and parameters. 

I investigate models where mutations at a subset of nonsynon- 
ymous sites can account for 5%, 10% or, 30% of the variance of 
the phenotype (i.e. the heritability accounted for by these variants 
is 5%, 10% or 30%). Thus, the models considered here assume 
that some fraction of the total heritability of the trait is accounted 
for by variants within the mutational target (i.e. a portion of the 
exome) while the rest is accounted for by variants not modeled 
here (i.e. noncoding portions of the genome). The mutational 
target size, M, is the number of nonsynonymous sites in the 
genome that, if mutated, would generate a variant that affects the 
phenotype. Assuming a mutation rate of 1x10" per site per 
generation, I investigate mutational target sizes of 70 kb and 
140 kb. To gain a sense of how these sites could be partitioned into 
genes, the median length of coding regions of human genes is 
1335 bp [58]. Thus, a random gene would have approximately 
934 nonsynonymous sites (assuming 70% of the coding sites are 
nonsynonymous). If all nonsynonymous sites within the gene 
would, if mutated, produce a causal variant, then the mutational 
target size of 70 kb would correspond to 75 distinct genes 
accounting for the specified heritability, and the target size of 
140 kb would correspond to 150 distinct causal genes accounting 
for the heritability. If only half the nonsynonymous sites could be 
mutated to causal variants, then the number of genes would 
increase by a factor of 2. In practice, this model is implemented by 
taking a subset of the nonsynonymous SNPs simulated as 
described above and then assigning them effects on the trait. 

To assign an effect on the trait to a given causal SNP, I follow 
the model described by Eyre-Walker [59], with modifications 
described below. Essentially, the z' th SNP's effect on a trait, Oli, is 
given by 

a.j = 5sj Z (\ + 8j)C, 

where 8=1, ,f, is the selective disadvantage for the f h SNP, x is the 
relationship between the SNP's effect on fitness and the trait. A 
value of x = 1.0 indicates a linear relationship, where the mutations 
that are most deleterious will also have the biggest effects on the 



trait. A value of x — 0.0 indicates that a mutation's effect on fitness 
is independent of its effect on the trait. I set x = 0.5 and x = 0.0, to 
model a situation where there is a relationship between fitness and 
the trait, and another situation where the trait is independent of 
fitness. Next, £, for the t h SNP is drawn from a normal distribution 
with mean 0 and a standard deviation of 0.5. I did not vary this 
standard deviation because Eyre-Walker showed that varying this 
parameter had little effect on the overall results [59]. C is a 
normalizing constant for the SNP effect sizes so that 

V A = Y, 2pi(l-Pi)°t*h 2 c , 

i SNPs 

where h 2 c e{Q. 05, 0.1,0.3}. Essentially, C is a scaling constant for 
the SNP effect sizes so that the desired heritability is achieved 
under each combination of parameters h 2 c , x, and M. Importantly, 
I find the average value of C across all simulation replicates in the 
standard neutral model, and then I use this value of C for 
simulations under the other demographic models. As such, a SNP 
with a given effect on the trait (ot;) under one demographic 
scenario will have the same effect on the trait under a different 
demographic scenario. This framework has the desirable property 
that a SNP's effect on a trait in a particular individual is 
biologically determined and is not directly affected by the 
demography of the population. Additionally, when setting up the 
simulations in this manner, the actual h 2 in a given simulation 
replicate is the outcome of a stochastic process, rather than set to a 
specific value. Nevertheless, in practice, there was little variation in 
h 2 across different demographic scenarios (Fig. SI). Incidentally, 
different values of C are found when using different values of /z^, x, 
and M (Table SI). This is reasonable because these models are 
biologically very different from each other. 

I then assign trait values (TJ) to each simulated individual. This is 
done using an additive model, 

Yj= ZijCX.j+Ej, 
i SNPs 

where the summation is over all i causal variants, Zij is the number 
of copies of the risk allele (zye{0,l,2}) carried at the z* SNP by 
the j h individual, ot; is the effect of the z' th SNP, and 8j is the 
environmental variance, which is drawn from a normal distribu- 
tion with mean 0 and variance \—h 2 c (see [60-62]). For some 
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analyses, I translate these quantitative traits into dichotomous 
diseases. To do this, I assume the liability threshold model for 
complex traits, where there is an underlying continuous distribu- 
tion of risk in the population, and where cases are those individuals 
whose risk (Yj) falls above a discrete threshold (L) [60,63,64]. L, or 
the liability threshold, is set in each simulation replicate by 
transforming the phenotypes (Tjs) to follow the standard normal 
distribution and then picking the threshold such that 40% of the 
individuals have liabilities greater than L. In this model, the disease 
has a prevalence of 40%. 1000 case individuals are randomly 
sampled from the individuals in this upper tail of the distribution. 
1000 controls are selected from the lower 60% of the distribution. 
Single-marker association tests are then performed for each SNP 
using Fisher's exact test. A test is considered significant if its P- 
value is <1 xl0~', unless otherwise stated. 

Results 

Recent growth and deleterious variation 

First I assess the effect that different population histories (BN, 
BN+growth, and Old Growth) have on neutral and deleterious 
genetic variation. Fig. 2A and Fig. 2B show how the number of 
synonymous and nonsynonymous SNPs, respectively, segregating 
in a sample of 1000 chromosomes change over time as the 
simulated populations change in size. The population bottleneck 
2000 generations ago resulted in a decrease in the number of SNPs 
segregating in the BN and BN+growth populations (orange and 
green lines in Fig. 2 A and Fig. 2B). When the populations 
recovered from the bottleneck and increased in size, the number of 
SNPs in the population also increased. This increase in the 
number of SNPs after the recovery from the bottieneck is due to 
two factors. First, the larger population size allows more new 
mutations to enter the population. Second, genetic drift has a 
weaker effect when the population size is large. As such, more 
SNPs are maintained in the population. The recent explosion in 
population size (dashed green lines in Fig. 2A and Fig. 2B; BN+ 
growth) rapidly results in a substantial increase in the number of 
both synonymous and nonsynonymous SNPs segregating in the 
population. This is due to the extreme increase in the population 
mutation rate (typically referred to as 0) due to the larger 
population size. Ancient population growth also resulted in an 
increase in the number of synonymous and nonsynonymous SNPs 
segregating in the population, via the same mechanisms (purple 
line in Fig. 2A and Fig 2B; Old growth). 

Fig. 2C shows how the proportion of nonsynonymous SNPs 
segregating in the population changes over time. When popula- 
tions BN and BN+growth decreased in size during the botdeneck, 
the proportion of nonsynonymous SNPs in the population also 
dropped (orange and green lines in Fig. 2C). The reason for this is 
that, when the population size decreases, rare variants are 
preferentially lost over common variants. More nonsynonymous 
than synonymous SNPs are rare, and, as such, the crash in 
population size results in the loss of more nonsynonymous SNPs 
than synonymous SNPs. After the population recovers from the 
bottleneck, the proportion of nonsynonymous SNPs found in the 
population increases (Fig. 2C). The reason for this increase is that, 
due to the increase in population size, many new mutations enter 
the population after the recovery from the bottleneck. Most of 
these new mutations are nonsynonymous, due to there being more 
possible nonsynonymous changes than synonymous changes in 
coding regions. In fact, the proportion of nonsynonymous SNPs 
segregating in the population immediately after the recovery of the 
bottleneck is actually higher than that in the ancestral population 
(Fig. S2). The very recent increase in size in the BN+growth 



population also results in an increase in the proportion of 
nonsynonymous SNPs (green line in Fig. 2C). 54.8% of the 
SNPs in BN+growth are nonsynonymous (green line in Fig. 2C), 
compared to 52.8% in the BN population (orange line in Fig. 2C). 
It will take approximately 4M e (where N e is the current effective 
population size) generations for the proportion of deleterious SNPs 
to reach the equilibrium value for the larger population size (Text 
SI). The population that underwent an ancient expansion (dotted 
purple line in Fig. 2C) also experienced an initial increase in the 
proportion of nonsynonymous SNPs segregating immediately after 
the expansion. However, because the expansion occurred long 
ago, selection has had sufficient time to remove many of the 
nonsynonymous SNPs and bring the proportion of nonsynon- 
ymous SNPs in the population below the value seen in the 
bottlenecked populations, consistent with previous simulations and 
empirical observations [54]. These results suggest that recent, 
extreme changes in demographic history can have an impact on 
patterns of deleterious mutations that are segregating in the 
population. This pattern also holds with other magnitudes of 
population growth (Text SI). 

Next I examine the average fitness effect of nonsynonymous 
SNPs segregating in a sample of 1000 chromosomes at different 
time points in the simulations (Fig. 2D). During the botdeneck, 
the average segregating SNP in the BN and BN+growth 
populations is less deleterious than in the ancestral population 
(orange and green lines in Fig. 2D). This is due to many rare, 
deleterious SNPs being eliminated from the population as well as 
fewer new deleterious SNPs entering the population when it is 
small in size. After the population recovers from the botdeneck, 
the average segregating SNP became more deleterious. In the first 
few generations after the recovery, the average SNP is even more 
deleterious than in the ancestral population. This is due to the 
increase in the input of deleterious mutations immediately after the 
recovery from the bottleneck. After a few generations however, 
negative natural selection has eliminated many, though certainly 
not all, of these deleterious SNPs from the population. In fact, 
Fig. 2D shows that even in the present day, the average SNP is 
more deleterious than that in the ancestral population. This same 
effect applies even more strongly to the recent population growth 
within the last 80 generations. Immediately, after growth, the 
average SNP segregating in the BN+growth population was more 
strongly deleterious (Fig. 2D) than what is expected in a 
population that has not expanded. However, during the last 40 
generations, selection has eliminated many of the most deleterious 
SNPs from the population. In the present day, the average SNP in 
the BN+growth population (green line) is slightly less deleterious 
than in the BN population (orange line, also see Text SI), 
consistent with the results of Gazave et al. [65]. This effect is less 
pronounced with decreasing amounts of population growth (Text 

si). 

The description of the average strength of selection on a SNP 
described above does not take into account the frequency of the 
deleterious SNP in the population. The genetic load, however, 
does by weighting the selection coefficient by the SNP's frequency 
[66]. Genetic load is the reduction in mean fitness of the 
population due to deleterious mutations [67]. I find that, unlike 
the average selection coefficient, the genetic load is not affected by 
the demographic history of the population (Fig. S3). Similar 
results have recently been reported by Simons et al. [38]. Thus, 
while the recent population growth increases the number of 
deleterious SNPs segregating in the population, this increase in 
load is offset by the fact that most of these new deleterious 
mutations are kept at very low frequency in the population. Put 
another way, while the load appears to be the same across 
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Figure 2. Changes in genetic variation over time as a function of population size. Solid orange lines denote the bottlenecked population 
that did not recently expand (BN). Dashed green lines denote a population that expanded 80 generations ago (BN+growth). Note that the lines from 
the two populations overlap except in the last 80 generations. Dashed purple lines denote the population that underwent an ancient expansion (Old 
growth). (A) Number of synonymous SNPs segregating in the sample. (B) Number of nonsynonymous SNPs segregating in the sample. (C) Proportion 
of SNPs segregating in the sample that are nonsynonymous. (D) Absolute value of the average fitness effect of nonsynonymous SNPs segregating in 
the sample. Samples of 1000 chromosomes were taken at different time points throughout the simulation. Results are averaged over 1000 simulation 
replicates. 

doi:1 0.1 371 /journal.pgen.1 004379.g002 



demographic models, the way in which the populations arrive at 
that load differs across demographies. The BN+growth population 
contains many rare deleterious mutations. The BN population 
contains fewer deleterious mutations, but those that are there tend 
to be at higher frequencies. 

Models of a complex trait that are compatible with 
empirical results 

One important question is to what extent recent population 
growth affects the architecture of complex traits and our ability to 
map the genes responsible for disease risk. To investigate this issue, 
I simulate quantitative phenotypes for individuals sampled from 
the simulations under the three different demographic models. I 
investigate two different models for the relationship between a 
mutation's effect on fitness (the selection coefficient), and its effect 
on the trait [59] . First, I assume that a mutation's effect on fitness 
is related to its effect on the trait (t = 0.5). Here, those mutations 



that are strongly deleterious have a greater effect on the trait. 
Second, I investigate a model where a SNP's effect on fitness is 
independent of its effect on the trait (x = 0). I also investigate 
different mutational target sizes (M= 70 kb and M— 140 kb) and 
the amount of the heritability accounted for by variants occurring 
within the mutational target {h 2 c e{0. 3, 0.1,0.05}; see Methods; see 
Discussion for further justification of these models). 

Because the model parameters were chosen to reflect what 
might be observed in exome resequencing data, I test the validity 
of these parameter values by comparing the expected number of 
single SNPs associated with a trait to what has been observed in 
exome sequencing studies. A recent exome sequencing study for 
type 2 diabetes with 1000 cases and 1000 controls (the same 
sample size simulated here) reported zero single SNP associations 
at a lxlO - ' significance level [30]. This result is broadly 
consistent with models where T = 0.5 and h 2 c = 0. 1 or h 2 c = 0.05 and 
and models with T = 0 and /z^. = 0.05 (Fig. S4). Next, an exome 
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sequencing study for schizophrenia with 2,536 cases and 2,543 
controls [68] found only common variants with P<10 5 , which, 
while not directly comparable to the sample size used in the 
present study, is again consistent with models where x = 0.5 and 
he = 0.1 or h 2 c = 0.05 and models with x = 0 and h 2 c = 0.05 (Fig. 
S4). Finally, an exome sequencing study for lipids included 2,005 
individuals and found two variants with P<10~ 5 in a single- 
marker analysis [69] . These results are compatible with a variety 
of models simulated here (Fig. S4), though Lange et al.'s 
quantitative association test should have more power than the 
dichotomous one used here, making precise comparisons more 
difficult. 

Due to the limited number of exome sequencing studies 
available, I also assess whether my models generate simulated 
datasets that are compatible with observations from GWAS 
studies. For simplicity, I make the assumption that the causal 
variants themselves have been directly assayed or have been 
imputed through LD with tag SNPs included in the GWAS. This 
of course is unlikely to be true in practice, particularly for rare 
variants [17,18,70]. Nevertheless, this comparison still serves as a 
useful benchmark to exclude models that are obviously not 
consistent with the observed GWAS data, with the caveat that 
some models that appear inconsistent with the GWAS data may 
actually fit better if LD and ascertainment bias are properly taken 
into account. 

First, it has been consistently shown that the top SNPs identified 
through GWAS account for only a very limited amount of the 
phenotypic variance (often <10%) [3,71]. I assess the amount of 
phenotypic variance (F>) explained by the top 50 SNPs that 
account for the most variance (Table S2) within each simulation 
replicate. Models where h 2 c = 0.3 and M— 70 kb or 140 kb predict 
that the top 50 SNPs account for roughly 30% of the Vp. This 
amount appears to be too high to be compatible with most of the 
GWAS results if one accepts the premise that GWAS would have 
detected the variants that explain such a large proportion of the 
phenotypic variance. However, a model with h 2 c = 0.05 predicts 
that the top 50 SNPs will account for about 5% of the phenotypic 
variance. Such a model cannot be rejected from the currently 
available GWAS data. 

Next, GWAS suggest that most risk loci for complex traits have 
very small effect sizes and are difficult to detect in samples of only 
1000 cases and controls [71]. Table S3 shows the expected 
number of GWAS hits (P<5xl0~ ) expected in each simulation 
replicate in samples of 1000 cases and 1000 controls for the 
different models of h c , M, and x. Models where h 2 c = Q.7> and 
M= 70 kb or 140 kb predict that 1-4 significant GWAS hits 
should be observed. This is too many to be compatible with the 
observed data. Models with the mutational target accounting for 
less of the heritability (h 2 c = 0.l and A|, = 0.05) predict <1 
significant association. Thus, these models cannot be rejected 
based on GWAS data. 

In summary, it is unclear how much of the heritability of 
complex traits in humans is accounted for by the exome and what 
the appropriate mutational target size for common diseases should 
be. Thus, I consider a variety of models examining different parts 
of this parameter space. Some of these models cannot be rejected 
on the basis of existing exome sequencing and GWAS data. Other 
models may be more compatible with GWAS data if imperfect LD 
between causal variants and genotyped variants was properly 
taken into account. Overall, these models provide a framework 
consistent with existing empirical data with which to investigate 
the effect of recent population history and the genetic architecture 
of complex traits. 



Recent growth and the heritability of complex traits 

Using the models of demography, selection, and genetic 
architecture described above, I first examine the effect that 
population history has on the heritability of the trait. I find that 
population history has littie effect on the heritability of the trait 
(Fig. SI), regardless of the values of h 2 ^, t, and M used in the 
simulations. This is evidenced by the fact that in all three 
demographic scenarios investigated, the actual heritability esti- 
mated from each simulation replicate is close to h 2 c , the value set in 
a constant size population. To further investigate the effect of 
recent growth on heritability, I divide the causal variants 
segregating at the end of the simulation into three categories. 
The first category consists of those SNPs that arose either further 
back in time than, or during the population bottleneck ("Before 
bottleneck" in Fig. 3). These mutations occurred > 1 960 
generations ago. The second category consists of SNPs that arose 
after the population had recovered from the bottleneck, but 
further back in time than the recent population growth ("After 
bottleneck" in Fig. 3). These mutations arose between 1960 and 
80 generations ago. The final category consists of SNPs that arose 
within the last 80 generations ("After growth" in Fig. 3). In the 
BN+growth model, these are the mutations that arose after the 
population expansion. Fig. 3A shows that the average heritability 
accounted for by mutations that arose at these three different time 
points is similar in both the BN+growth population (green boxes), 
and the BN population (orange boxes). Interestingly, when a 
mutation's effect on fitness is correlated with its effect on the trait 
(t = 0.5), mutations that arose in the last 80 generations, as a class, 
account for the greatest amount of the heritability (Fig. 3A). 

Next, I investigate whether other features of genetic variation 
that affect the heritability (e.g. number of SNPs, mean allele 
frequency, mean effect size) are affected by recent population 
history. I find that recent growth has had littie effect on the 
number of mutations that arose prior to the population bottleneck 
and are still segregating in the sample, the mean allele frequency, 
and the effect sizes of such mutations (Fig. 3B, Fig. 3C, and 
Fig. 3D). However, there is a different pattern for mutations that 
arose after the bottleneck, but more than 80 generations ago (those 
in the "After bottleneck" category). Recent population growth 
increases the number of such mutations (roughly 2-fold) relative to 
that found in the population that did not expand (Fig. 3B). 
Further, these mutations tend to be at lower frequency in the BN+ 
growth population compared to the BN population (Fig. 3C). The 
only difference between the two models of population history on 
variants that arose during this time period is that genetic drift is 
weaker in the BN+growth population, compared to the BN 
population. Thus, fewer weakly deleterious mutations are lost from 
the BN+growth population, generating the pattern seen in 
Fig. 3B. The mutations that are not lost from the population 
tended to be at lower frequency in the larger population because 
they also are less likely to drift to higher frequency in the expanded 
population as compared to the non-expanded population. The 
mutations that arose within the last 80 generations also are 
affected by recent population history (those in the "After growth" 
category). As expected, recent population growth leads to a 
dramatic increase in the number of such SNPs (Fig. 3B). Further, 
the new mutations tend to be at lower frequency in the BN+ 
growth population than in the BN population (Fig. 3C). More 
surprisingly, these SNPs tend to have weaker effect sizes on the 
trait in the BN+growth population than in the BN population 
(Fig. 3D). This observation can be explained by selection more 
effectively removing moderately and strongly deleterious muta- 
tions from the recently expanded population than from the non- 
recently expanded populations [65] . Thus, while recent population 
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Figure 3. Effect of recent population growth on the heritability attributable to mutations of different ages when t = 0.5. Orange 
boxes denote the bottlenecked population that did not recently expand (BN). Green boxes denote a population that expanded 80 generations ago 
(BN+growth). "Before bottleneck" refers to mutations that arose more than 1960 generations ago (before or during the bottleneck). "After 
bottleneck" refers to mutations that arose after the population recovered from the bottleneck, but earlier than 80 generations ago. "After growth" 
refers to mutations that arose within the last 80 generations (after the population expanded). (A) Heritability attributed to mutations of different 
ages. Note that recent population growth does not affect the median heritability attributable to mutations of different ages. (B) Number of SNPs 
segregating in the present-day that arose during the different time intervals. (C) Mean allele frequency of SNPs that are segregating in the present- 
day that arose during the different time intervals. (D) Mean effect size of SNPs that are segregating in the present-day that arose during the different 
time intervals. Here /i^. = 0.05 and M = 70 kb. 
doi:1 0.1 371 /journal.pgen.1 004379.g003 



history affects the number of mutations, frequencies, and effect 
sizes of these mutations, it does so in such a way that the overall 
heritability of the trait appears unaffected by population history. 
Fig. S5 shows similar plots for the model where a mutation's effect 
on the trait is not correlated with its effect on fitness (t = 0). 

Recent growth increases the contribution of rare variants 
to the additive genetic variance 

While population history does not affect the overall heritability 
of the trait, it can have a profound impact on the additive genetic 
variance (VJ), and consequently, the heritability, attributable to 
low-frequency vs. common variants (Fig. 4). When a mutation's 
effect on fitness is correlated with its effect on the trait (t = 0.5), 



more than 50% of the additive genetic variance in the trait is 
attributable to SNPs with frequency <0.5% in the population 
under all demographic scenarios, consistent with previous work 
showing of the importance of low-frequency variants [59,72,73]. 
Crucially, the amount of the variance attributable to rare variants 
(<0.1%) varies greatly due to demographic history. Roughly twice 
as much of the genetic variance of the trait in the recendy 
expanded population (BN+ growth; green line in Fig. 4A) is 
accounted for by SNPs with frequency <0.05% than in the 
population that underwent the bottleneck, but did not expand 
(BN; orange line in Fig. 4A). The population that underwent 
ancient growth falls intermediate to the other two cases (Old 
growth; purple line in Fig. 4A). Similar results hold for other 



PLOS Genetics | www.plosgenetics.org 



7 



May 2014 | Volume 10 | Issue 5 | e1004379 



Deleterious Mutations, Demography, and Disease 



heritabilities and mutational targets (Fig. S6A-C). The situation is 
dramatically different if a SNP's effect on the trait is uncorrelated 
with its effect on fitness (t = 0; Fig. 4B). Here littie of the variance 
of the trait is accounted for by low-frequency variants, as seen by 
Eyre-Walker [59]. Additionally, under this model, demographic 
history does not make as substantial an impact on the amount of 
the additive genetic variance explained by SNPs at different 
frequencies, as suggested by Simons [38]. Again, similar results 
hold for other heritabilities considered and mutational target sizes 
(Fig. S6D-F). Thus, in some instances, recent population growth 
can result in a substantial increase in the amount of the genetic 
variance attributable to rare variants. 
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Figure 4. Cumulative distribution of the amount of the additive 
genetic variance of a trait (V A ; y-axis) explained by SNPs 
segregating below a given frequency in the population (x- 
axis). (A) A SNP's effect on the trait is correlated with its effect on 
fitness (t = 0.5). Note that the population that experienced recent 
growth (green; BN+growth) has a higher proportion of V A accounted for 
by low-frequency SNPs (<0.1% frequency) than the populations that 
did not recently expand (orange and purple; BN and Old growth). (B) A 
SNP's effect on the trait is independent of its effect on fitness (t = 0). 
Note that less of V A is accounted for by low-frequency variants than 
when the trait is correlated with fitness (A). Here lr c = 0.05 and 
Af = 70kb. 

doi:1 0.1 371 /journal.pgen.1 004379.g004 



Recent growth increases genetic heterogeneity of 
disease 

Population history also has a profound impact on the number of 
causal mutations in a sample of 1000 individuals who have been 
selected from the upper 40 th percentile of the distribution of the 
quantitative trait (Fig. 5). These individuals can be thought of as 
cases. Here recent growth (BN+growth) is predicted to result in a 
substantial increase in the number of causal mutations compared 
to a population that has not undergone such recent growth (BN; 
orange vs. green boxes in Fig. 5A). In fact, a sample of 1000 cases 
from the BN+growth population is predicted to have nearly twice 
as many distinct causal mutations as a sample from the BN 
population. An explanation for these patterns is that many new 
deleterious causal mutations have arisen after the population has 
expanded in size. Because they are new and rare, they are only 
found in a small number of individuals. As such, each individual 
has his/her own set of low-frequency risk mutations. When 
aggregating this number across thousands of individuals, the total 
number of causal mutations in the sample from the BN+growth 
population is higher than in the BN population. Interestingly, the 
number of distinct causal mutations is actually higher in the BN+ 
growth population than in the Old growth population (purple box 
in Fig. 5A). 
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Figure 5. The number of causal variants in a sample of 1000 
cases from each simulated population. (A) A SNP's effect on the 
trait is correlated with its effect on fitness (x = 0.5). Note that the 
population that experienced recent growth (green; BN+growth) has a 
higher number of causal variants than the population that did not 
recently expand (orange; BN). (B) A SNP's effect on the trait is 
independent of its effect on fitness (t = 0). Here h 2 c = 0.05 and 
M = 70 kb. 

doi:10.1371/journal.pgen.1004379.g005 
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A similar increase in the number of causal variants in the sample 
from the recently expanded population relative to a non-expanded 
population is seen even when a SNP's effect on the trait is 
uncorrelated with its effect on fitness (t = 0; Fig. 5B). This pattern 
is due to the fact that there is the same number of rare causal 
variants in the BN+growth population even when T = 0. However, 
when T — 0, many of these rare causal mutations have smaller 
effect sizes, and consequently, do not account for much of the 
phenotypic variance of the trait. 

To further explore this issue, I examine how much of the 
phenotypic variance (Vp) in the population can be accounted for 
by the SNPs that explain the most V P (Fig. 6). When x = 0.5, the 
top SNPs that explain most of the variance account for less of it in 
the population that recently expanded (BN+growth) than in the 
population that did not (BN). For example, when h 2 c = 0.05, the 25 
SNPs that account for the most fpwill account for 5% of the Vpin 
the BN population (orange line in Fig. 6A). In contrast, for the 
BN+growth population (green line in Fig. 6A), the 25 SNPs that 
account for the most F>will only explain <3.5% of it. Put another 
way, the top 25 SNPs that explain the most variance account for > 
90% of the Va contained within the mutational target in the BN 
population, but <70% of the Vj in the BN+growth population. 
These results suggest that, if mutational effects on disease are 
correlated with their effects on fitness, many of the additional rare 
causal variants found in a recently expanded population, may, in 
aggregate, explain a substantial proportion (say 20%) of the 
heritability of the trait. 

If a mutation's effect on fitness is independent of its effect on 
disease (t = 0), then the top SNPs that explain the most variance 
account for almost all of the Va contained within the mutational 
target (Fig. 6B). For example, in the model where /)^. = 0.05, the 
top 25 SNPs will account for nearly 5% of the Vp, regardless of the 
demographic history of the population. Put another way, here the 
top 25 SNPs account for the majority of the Va that is contained 
within the mutational target, and this pattern is not affected by the 
demography of the population. This finding supports the previous 
statement that, if a mutation's effect on fitness is independent of its 
effect on the trait, many of the extra causal mutations seen in 
Fig. 5B in the recently expanded population actually contribute 
very little to the overall Vp of the trait. Similar results are found for 
other values of h 2 c and mutational target sizes (Fig. S7). 

Effect of demography on the power of association tests 

Next, I investigate how different demographic histories affect 
the power to associate SNPs with a trait in a sample of 1 000 cases 
and 1000 controls. Most power simulations for association tests 
examine the power to detect a given causal variant conditional on 
its allele frequency and/ or effect size. Using this approach, I find 
that power to detect the SNPs that explain the greatest amount of 
Va is actually higher in the population that recently expanded 
(BN+growth) than in the population that only underwent a 
bottleneck (BN; Text S2). However, recent population growth has 
a more limited effect on the power to detect a given association 
when conditioning on the allele frequency or odds ratio of the 
causal SNP (Text S2). 

The power analyses summarized above refer to the power to 
detect a given causal variant, conditional on various attributes of it. 
However, the number of causal variants, their frequencies, and 
their effect sizes are random variables that are influenced by the 
evolutionary process experienced by the population under study. 
Thus, it is also useful to examine the expected number of causal 
SNPs with P-values less than the significance threshold across the 
different models of demographic history (Fig. 7). The expected 




Top SNPs 

Figure 6. Cumulative distribution of the amount of the 
phenotypic variance of a trait (Vp) /-axis) explained by the 
SNPs that explain the most variance U-axis). (A) A SNP's effect on 
the trait is correlated with its effect on fitness (t = 0.5). Note that the top 
SNPs that account for the most variance explain less of it in the 
population that experienced recent growth (green; BN+growth) than in 
the populations that did not recently expand (orange and purple; BN 
and Old growth). (B) A SNP's effect on the trait is independent of its 
effect on fitness (x = 0). Here /;^ = 0.05 and /W = 70 kb. 
doi:10.l371/journal.pgen.l004379.g006 

number of causal SNPs detected in a study of a given sample size 
will account for both the power to detect a given variant as well as 
the number, frequency distribution, and effect size distribution of 
causal variants in the population. It also directly answers the 
relevant question for researchers who are planning and interpret- 
ing an association study: Under a given model of genetic 
architecture, with a given sample size, how many significant 
associations would I expect to detect? 

Under the model where a mutation's effect on fitness is 
correlated with its effect on the trait (x = 0.5), /z|. = 0.3, and 
M— 70 kb, fewer causal mutations are detected in the populations 
that have undergone ancient (Old growth; purple box in Fig. 7A) 
or recent (BN+growth; green box in Fig. 7A) expansions, relative 
to the population that only underwent a recent bottleneck (BN; 
orange box in Fig. 7A). Similar trends are seen for the other 
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Figure 7. The number of causal SNPs with a significant /'-value 
«1 10 5 ) in the single-marker association test for different 
models of population history. (A) A SNP's effect on the trait is 
correlated with its effect on fitness (t = 0.5). (B) A SNP's effect on the 
trait is independent of its effect on fitness (x = 0). Here h 2 c = 03 and 
/M = 70kb. 

doi:1 0.1 371 /journal.pgen.1 004379.g007 

models of }r c and AI (Fig. S4). However, when /j^. = 0.05, sample 
sizes of 1 000 cases and controls are too small to detect almost any 
associations with P< 1x10 , regardless of the demography of the 
population. When using a less stringent significance threshold (P< 
0.01), /i^ = 0.3, and M= 70 kb, a median of 10 causal loci are 
associated with the trait in the BN population (Fig. S8A). 
However, a median of only 8 causal loci are detected in the BN+ 
growth population. Again, similar trends are noted for the other 
models of h 2 c and M (Fig. S8). However, when /!^ = 0.05, a 
median of 2 causal SNPs are detected at P<0.01 for all three 
demographic models. This result is due to the very low power to 
detect an association for causal variants with very small effect sizes 
using samples of 1000 cases and 1000 controls, regardless of the 
demography of the population. Nevertheless, even here, a higher 
proportion of simulation replicates have detected at least 3 
associations in the BN population (54%) than in the BN+growth 
population (41%). Taken together, this analysis suggests that 
recent population growth can result in a decrease in the expected 
number of associations to be detected in a given sample size. Thus, 
while recent growth may increase power to detect the SNP that 
explains the greatest amount of the variance, and have little effect 
on power to detect a given SNP conditional on its frequency or 
effect size, it enriches the frequency distribution for rare causal 
variants. The power to detect such variants using single-marker 
association tests is low, decreasing the expected number of 



significant associations to be detected in the population that 
recently expanded. 

However, demographic history has no clear effect on the 
number of causal loci detected with a given sample size when the 
mutation's effect on fitness is independent of its effect on the trait 
(t = 0; Fig. 7B, Fig. S4D-F, and Fig. S8E-H). For some models, 
the BN+growth population appears to have a higher number of 
significant associations than in the BN population (Fig. S4D and 
Fig. S4E). However, this pattern is not consistently seen across 
models or significance thresholds. Similarly, when using a 
significance threshold of P<0.01, the Old growth population 
appears to show a greater number of significant association (Fig. 
S8) than either of the other two models of population history. This 
pattern may be due to the slight, but noticeable, increase in the 
h 2 c for the Old growth population (Fig. SI). 

Researchers have suggested that the amount of the additive 
genetic variance {Va) explained by a set of SNPs increases when 
the stringency of the P-value threshold for including SNPs in the 
set is decreased [74-76] . Fig. 8 shows the amount of Va explained 
by SNPs having single-marker P-values less than the threshold 
specified on the x-axis for the model where h 2 c = 0.05and 
M= 70 kb. When a SNP's effect on the trait is correlated with 
its effect on fitness (t = 0.5), population history has an important 
effect on the amount of Va accounted for by SNPs with P-value less 
than a given threshold (Fig. 8A). Specifically, recent population 
growth decreases the amount of V A accounted for by SNPs at all P- 
value thresholds, relative to what is seen in a population that has 
not expanded. For example, SNPs having P<0.05 account for 
about 30% of the V A in the BN population. SNPs with P<0.05 
account for only 20% of Va in the BN+growth population. 
Including all SNPs detected in the case-control study captures only 
70% of the V A contained within the mutational target in the BN+ 
growth population. The reason for this is that many of the rare 
variants that account for the Va of the trait in the population are 
not present in the sample of 1000 cases and 1000 controls. When 
T = 0, population history has little affect on the amount of Va 
accounted for by SNPs with a given P-value (Fig. 8B). Including 
all SNPs present in the association study captures over 95% of the 
Va contained within the mutational target. This finding is not 
surprising in light of the observation (Fig. 4B) that much of the Va 
is accounted for by common variants when T = 0, and such 
variants are likely to be present in the sample of 1000 cases and 
1000 controls. Qualitatively similar trends are seen for other 
heritablities and mutational target sizes (Fig. S9). 

Discussion 

I have shown that very recent population growth can have an 
impact on patterns of deleterious genetic variation and the genetic 
architecture of complex traits. Specifically, I show that recent 
population growth leads to an increase in the proportion of 
nonsynonymous SNPs relative to that expected in non-expanded 
populations. Further, this recent growth is predicted to have 
affected the genetic architecture of some complex traits. This result 
has implications for discovering the "missing heritability" in 
different human populations and detecting causal variants that 
may also affect reproductive fitness. 

While I have shown that demographic history greatly affects the 
proportion and frequencies of deleterious mutations segregating in 
the population, it is interesting that demography does not have a 
large effect on the overall genetic load of the population. Haldane 
has shown that the genetic load at equilibrium contributed by a 
particular mutation is independent of the strength of selection 
acting on the particular mutation and the frequency of that 
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Figure 8. Cumulative distribution of the amount of the additive 
genetic variance of a trait (V A ; /-axis) explained by SNPs with a 
single-marker association test P-value less than a given 
threshold (x-axis). (A) A SNP's effect on the trait is correlated with 
its effect on fitness (x = 0.5). Note that the population that experienced 
recent growth (green line; BN+growth) has a lower proportion of V A 
accounted for by SNPs at any P-value threshold than the populations 
that did not recently expand (orange and purple lines; BN and Old 
growth). (B) A SNP's effect on the trait is independent of its effect on 
fitness (t = 0). Here note that the SNPs with low P-values (<0.05) 
account for most of the V A regardless of the demographic history of the 
population. Here /;^ = 0.05 and M = 70 kb. 
doi:10.1371/journal.pgen.1004379.g008 



mutation [66]. Strongly deleterious mutations will have a large 
effect on fitness, but will be kept at lower frequency by negative 
selection than mutations of weaker effect. Weakly deleterious 
mutations will not have as large of an effect, but they can be at 
higher frequency in the population. Haldane suggested that these 
effects should cancel each other out. Haldane's result was derived 
for a simple model with a constant population size. It was unclear 
whether this result would hold when considering populations with 
bottlenecks and recent growth, though previous theoretical results 
suggested that smaller populations may be expected to have a 
higher load [77]. Here I have shown that Haldane's result applies 
under certain complex demographic models. Further work is 



required to determine whether this trend holds in other species 
and demographic models, and whether this trend holds for models 
involving dominance. 

It is also important to appreciate that while my analysis as well 
as that of Simons et al. [38] show that recent population history 
has not affected the overall genetic load of the population, these 
results should not be taken to imply that population history has 
had no effect on patterns of deleterious variation. First, the way in 
which the populations arrive at their load differs across popula- 
tions with different histories. Populations that have recendy 
expanded have more of the load accounted for by rare variants 
than populations that have not recently expanded. Further, 
genetic load is one specific feature of deleterious genetic variation. 
Other statistics, like the proportion of nonsynonymous SNPs, are 
very sensitive to differences in population history. While it has 
been shown that differences in population history between 
European and African populations have affected the proportion 
of nonsynonymous SNPs in the two populations [54], here I 
demonstrate that the influence of population history on the 
proportion of nonsynonymous SNPs is predicted to apply on a 
much more recent timescale, and to populations that are much 
more similar to each other than Europeans and Africans. In sum, 
taken together, these results suggest that population history affects 
patterns of deleterious genetic variation found in different 
populations. 

I find that population history is predicted to have little effect on 
the overall amount of additive genetic variance for a trait seen in 
different populations. As such, assuming a common environmental 
variance across populations, the heritability of a trait is predicted 
to be similar across populations. This finding suggests that if 
differences in the heritability of a trait are detected across 
populations, these differences are more likely to be due to differing 
environmental effects, rather than due to different amounts of 
additive genetic variance. For example, it has been suggested that 
the heritability of height in a West African population is less than 
that typically estimated from European populations [78]. My 
results would argue that such a difference would be due to shifts in 
the environmental variance, rather than changes in amount of 
additive genetic variance as a result of differences in recent 
population history. 

A major conclusion of this study is that recent population 
growth has a greater effect on the architecture of traits when a 
mutation's effect on fitness is correlated with its effect on the 
phenotype than when the mutation's effect on fitness is indepen- 
dent of its effect on the phenotype. The extent to which mutations 
that increase risk of disease are under negative selection remains 
unclear. While it is intuitive that disease mutations should be 
under negative selection, most common diseases have an onset 
after reproductive age, and as such, may not be correlated with 
reproductive fitness. However, it is possible that mutations 
increasing risk to late-onset disease may have pleiotropic effects 
and could affect traits related to reproduction [59,79]. In fact, it 
has been suggested that pleiotropy is rather frequent for many 
common SNPs associated with disease [80] , and may apply to rare 
variants as well. Additionally, genes and common variants 
associated with common diseases show signatures of negative 
selection [81,82], suggesting that disease variants may be under 
negative selection. A third line of evidence comes from studies of 
model organisms. A mutagenesis study [83] found that P-element 
insertions that contributed the most to the variance in bristle 
number in Drosophila tended to reduce viability. A fourth line of 
evidence suggesting that a mutation's effect on complex disease 
may be correlated with fitness comes from an empirical analysis of 
GWAS data. Looking at over 350 susceptibility SNPs across eight 
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categories of phenotypes, Park et al. [84] found that low-frequency 
SNPs tended to have larger effect sizes than more common SNPs 
(significantly so for type 1 diabetes, height, and LDL levels), even 
after correcting for the ascertainment bias resulting from the 
reduced power to detect associations with low-frequency SNPs of 
weak effect. Such a correlation is expected under models where a 
mutation's effect on fitness is correlated with its effect on the trait 
(t = 0.5; Fig. S10A). However, a correlation between a mutation's 
effect on disease and its frequency is not expected under a model 
where a mutation's effect on disease is independent of its effect on 
fitness (x = 0; Fig. S10B). There is little direct evidence to indicate 
whether this result holds for low-frequency variants in coding 
regions. But, selection is likely to be stronger (per base pair) in 
coding regions than throughout noncoding regions of the genome 
[85] , suggesting this result should hold for coding regions as well. 
Finally, the correlation between a mutation's effect on fitness and 
its effect on a trait is likely to depend on the particular trait 
involved. While further empirical and theoretical work is needed 
in this area, all of these lines of evidence suggest that, for some 
traits, it is plausible that a mutation's effect on fitness could indeed 
be correlated with its effect on disease. 

Further rationale for considering models where a mutation's 
effect on fitness is correlated with its effect on the trait comes from 
exome sequencing studies themselves. A major assumption made 
in exome sequencing studies is that some of the missing heritability 
can be explained by rare variants of large effect that increase risk 
to disease [3,4] . If there is no correlation between a mutation's 
effect on disease and its effect on fitness, then there is no reason for 
the effects on disease risk to be greater for rare variants than for 
more common variants. Under this model (where fitness effects are 
independent of trait effects), effect sizes would be randomly 
assigned to SNPs, regardless of their allele frequency. On the other 
hand, if a mutation's effect on fitness is correlated with its effect on 
disease, then the SNPs with the strongest effects on disease are 
likely to be the most deleterious ones. As such, they will also be the 
most rare in the population due to negative selection. Because the 
exome sequencing paradigm essentially assumes that the effect of a 
coding region mutation on disease is correlated with its effect on 
fitness, it is important to investigate the proprieties of such a model 
under different population histories. 

My models make several predictions that can be tested with 
empirical data. While these models have been developed to apply 
to exome sequencing data, because the predictions are robust to 
the mutational target size and heritability accounted for by the 
mutations in the target region (Fig. SI, Fig. S4, Fig. S6-Fig. 
S9), they should apply to GWAS data as well, especially if low- 
frequency variants are imputed from a reference panel, like the 
1000 Genomes Project. First, the models predict that if a 
mutation's effect on fitness is correlated with its effect on the 
trait, common variants should account for more of the heritability 
in a population that did not expand than in one that had recently 
expanded. This prediction can be tested by analyzing GWAS data 
in expanded vs. non-expanded populations. Second, for a given 
sample size, if a mutation's effect on fitness is correlated with its 
effect on the trait, the models predict that fewer significant 
associations will be detected in the recently expanded population 
than in a population that has not expanded. This prediction can 
also be tested by comparing the number of significant associations 
detected in GWAS data from the expanded population vs. the 
non-expanded population. Third, the prediction that, if a 
mutation's effect on fitness is correlated with its effect on the 
trait, low-frequency variants should account for more of the 
heritability in the recently expanded population than in a non- 
expanded population can be tested directly once large-scale exome 



sequencing data in both expanded and non-expanded populations 
has been collected. Failing to find these patterns in GWAS and 
exome sequencing data would suggest that there is little correlation 
between a mutation's effect on fitness and its effect on the trait. 

Several recent studies have used results from population genetic 
models to guide the design and interpretation of association studies 
of rare variants [86,87]. My results are especially complementary 
to those Zuk et al. [87]. In particular, they argue that a population 
expansion does not increase the proportion of the disease due to 
new alleles. My finding of a similar contribution of young alleles to 
the heritability in expanded vs. non-expanded populations 
(Fig. 3A) supports their conclusion, despite different modeling 
assumptions in the two studies. Zuk et al. [87] also argue that the 
"role" of rare variants in disease is increased in an expanding 
population. Here I provide a more detailed analysis of this topic 
using an explicit quantitative genetic model and evaluate the 
conditions under which recent population growth accentuates the 
contribution of rare variants to the heritability of the trait. 

One important limitation of the present study is that I evaluated 
the power of single-marker association tests, rather than gene- 
based association tests. It has been suggested that single-marker 
tests may be under-powered relative to gene-based tests for 
detecting associations with rare variants [37,88]. I did not consider 
gene-based association tests in the present study because the 
present simulations assume that all SNPs are independent of each 
other. Thus, there is no way to simulate genes containing multiple 
SNPs having the appropriate LD structure in this framework. 
Future work could simulate larger genie regions using other 
approaches [56]. However, the analysis of the performance of 
single-marker association tests still provides important insights for 
understanding how demography affects the ability to map genes 
for complex traits. First, many association studies of rare variants 
include single-marker association tests, even if they also consider 
gene-based tests [23-25,88]. Thus, my results are directly 
applicable to designing and interpreting such studies. Second, 
several gene-based association tests combine the single-marker 
association tests in various ways. Thus, the manner in which 
single-marker signals are affected by demography is relevant for 
such tests. Third, it is not clear that gene-based tests are always 
superior to single-marker tests. A recent study [89] has suggested 
that gene-based tests based on single-marker association statistics 
may be more powerful than other "burden tests." Further, 
performance of gene-based association tests is known to decrease if 
many non-causal SNPs are included [22,87,89]. Also, if causal 
variants are scattered across many distinct genes, then gene-based 
association tests may not provide an increase in power over single- 
marker tests [30] . As sample sizes continue to grow, it is likely that 
single-marker tests will be more frequentiy used for sequencing- 
based association studies because they are simpler to implement 
and eliminate the need to decide how to combine variants within a 
gene. Thus, insights gained from the analysis of single-marker 
association tests in the present study should still be useful. 

Another possible limitation of this study is that my models do 
not allow some mutations affecting complex traits to be beneficial 
or under balancing selection. There is some evidence that loci 
associated with certain traits (obesity and type 2 diabetes, in 
particular [90,91]) may have been affected by positive selection. 
However, this pattern was found for common variants outside of 
coding regions. It is not clear how many nonsynonymous causal 
mutations would be expected to be under positive selection. 
Additionally, loci under positive or balancing selection should have 
already been detected in GWAS (because they would be common 
variants), and are not the type of loci researchers are aiming to 
discover through exome sequencing studies. 
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My results suggest that if a mutation's effect on fitness is 
correlated with its effect on the trait, recent population history can 
have an important effect on the ability to detect associations with 
causal variants. In the simulations, fewer causal SNPs were 
significandy associated with disease in the population that 
experienced recent growth as compared to the population that 
did not expand. This result would imply that, in order to detect the 
greatest number of causal loci for a given sample size, it would be 
better to focus on populations that experienced a bottleneck, but 
did not experience a recent population expansion. Currendy, it is 
not clear which populations meet this criterion. Further sequenc- 
ing of large samples of individuals will be required to determine 
which populations have not experienced recent population 
growth. For a variety of reasons, most GWAS have been done 
in large samples of cases and controls of European ancestry [92] . 
The same trend may hold for exome and genome re-sequencing 
studies. Recent genetic studies, as well as historical records indicate 
that many European populations are precisely those that 
experienced the type of extreme, recent population growth 
simulated in this study [36]. The simulations presented here 
suggest that focusing on such populations may not discover the 
largest number of causal variants for a given number of individuals 
sequenced. 

An important goal in human genetics is to understand disease 
risk in populations throughout the globe. While focusing on 
populations that have not experienced ancient or recent growth 
may yield the largest list of putative causal loci, it is currently not 
clear whether such an approach will lead to increased under- 
standing of the genetic basis of disease in all populations — not just 
the population under study. Analyses of common variants 
implicated through GWAS suggest that loci associated with traits 
in European populations also affect the traits in other non- 
European populations [93-96] . However, it is unclear whether this 
trend also holds for rare variants that have arisen after the 
populations have split from each other. Thus, in order to 
understand the genetic basis of complex traits across the globe, 
it will be important to study populations that have recendy 
expanded in addition to those more stable in size, recognizing that 
larger sample sizes in populations that have recendy expanded will 
be necessary to achieve comparable power to that in non- 
expanded populations. 

Finally, these results are directly relevant for finding the 
"missing heritability" in different populations. If a mutation's 
effect on disease is correlated with its effect on fitness, then more of 
the heritability will be explained by very rare variants in a 
population that experienced a recent expansion than in a 
population that did not recently expand. Additionally, the variants 
detected by single-marker association tests explain less of the 
heritability in a recently expanded population than in a population 
that did not recently expand. Thus, while the overall heritability of 
a trait may not be variable across populations, our ability to 
discover the variants that account for it is likely to vary across 
populations due to differences in demographic history. 

Supporting Information 

Figure SI Population history has little effect on the narrow-sense 
heritability (h 2 ) of a trait. (A-D) A SNP's effect on the trait is 
correlated with its effect on fitness (x = 0.5). (E-H) A SNP's effect 
on the trait is independent of its effect on fitness (x = 0). (A, E) 
h 2 c = Q3 and Af= 70 kb. (B, F) h 2 c = 03 and M= 140 kb. (C, G) 
h 2 c = 0A and M=70kb. (D, H) /£ = 0.05and M= 70 kb. 
Narrow sense heritability was computed for each demographic 

model as h 2 = — — ^— - . Here VE = \—h 2 r for all scenarios. 
Va + Ve 



Va= J2 2/7,(1 —pi)oif, p, is the frequency of the t h SNP, and ot ; 

i SNPs 

is the i h SNP's effect on the trait. 
(PDF) 

Figure S2 Proportion of nonsynonymous SNPs over time. Note 
that the proportion of nonsynonymous SNPs after the bottleneck 
(near 0 generations) is higher than that in the ancestral population 
(at time 4000 generations ago) for both the model with recent 
population growth (dashed green line) and the model without 
recent population growth (solid orange line). 
(PDF) 

Figure S3 Population history has little effect on the genetic load. 
Genetic load was calculated for all SNPs segregating in a sample of 
6000 individuals taken from each demographic history. 
(PDF) 

Figure S4 The number of causal SNPs with a significant P- value 
(< 1 x 1 0 5 ) in the single-marker association test for additional 
models of the trait. Orange denotes the bottleneck demographic 
model (BN). Green denotes the bottleneck and recent growth 
model (BN+growth). Purple denotes the ancient growth model 
(Old growth). (A-C) A SNP's effect on the trait is correlated with 
its effect on fitness (x = 0.5). (D-F) A SNP's effect on the trait is 
independent of its effect on fitness (x = 0). (A, D) h^, = 0.3 and 
M= 140 kb. (B, E) h 2 c =0.1 and M= 70 kb. (C, F) h 2 c =0.05 
and M=70 kb. 
(PDF) 

Figure S5 Effect of recent population growth on the heritability 
attributable to mutations of different ages when x = 0. Orange 
boxes denote the bottlenecked population that did not recendy 
expand (BN). Green boxes denote a population that expanded 80 
generations ago (BN+growth). "Before botdeneck" refers to 
mutations that arose more than 1960 generations ago (before or 
during the bottleneck). "After bottleneck" refers to mutations that 
arose after the population recovered from the bottleneck, but 
earlier than 80 generations ago. "After growth" refers to mutations 
that arose within the last 80 generations (after the population 
expanded). (A) Heritability attributed to mutations of different 
ages. Note that recent population growth does not affect the 
median heritability attributable to mutations of different ages. (B) 
Number of SNPs segregating in the present-day that arose during 
the different time intervals. (C) Mean allele frequency of SNPs that 
are segregating in the present-day that arose during the different 
time intervals. (D) Mean effect size of SNPs that are segregating in 
the present-day that arose during the different time intervals. Here 
h 2 c = 0.05 andM=70 kb. 
(PDF) 

Figure S6 Cumulative distribution of the amount of the additive 
genetic variance of a trait (Va, jy-axis) explained by SNPs 
segregating below a given frequency in the population (x-axis) 
for additional models of the trait. (A-C) A SNP's effect on the trait 
is correlated with its effect on fitness (x = 0.5). Note that the 
population that experienced recent growth (green; BN+growth) 
has a higher proportion of Va accounted for by low-frequency 
SNPs (<0.1% frequency) than the populations that did not 
recently expand (orange and purple; BN and Old growth). (D-F) 
A SNP's effect on the trait is independent of its effect on fitness 
(x = 0). Note that less of Va is accounted for by low-frequency 
variants than when the trait is correlated with fitness (A). 
(PDF) 

Figure S7 Cumulative distribution of the amount of the 
phenotypic variance of a trait (Vp, y-axis) explained by the SNPs 
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that explain the most variance (x-axis) for additional models of the 
trait. (A-C) A SNP's effect on the trait is correlated with its effect 
on fitness (t = 0.5). Note that the top SNPs account for less of the 
phenotypic variance in the population that experienced recent 
growth (green; BN+growth) than in the populations that did not 
recently expand (orange and purple; BN and Old growth). (D— F) 
A SNP's effect on the trait is independent of its effect on fitness 

(T = 0). 

(PDF) 

Figure S8 The number of causal SNPs with a P-value < 1 x 1 0 ~ 2 
in the single-marker association test for different models of 
population history and the trait. Orange denotes the bottleneck 
demographic model (BN). Green denotes the botdeneck and 
recent growth model (BN+growth). Purple denotes the ancient 
growth model (Old growth). (A— D) A SNP's effect on the trait is 
correlated with its effect on fitness (t = 0.5). (E-H) A SNP's effect 
on the trait is independent of its effect on fitness (x = 0). (A, E) 
h 2 c = 0.3 and M= 70 kb. (B, F) /£ = 0.3 and M= 140 kb. (C, G) 
A 2 , = 0.1 and M= 70 kb. (D, H) h 2 c = 0.05 and M= 70 kb. 
(PDF) 

Figure S9 Cumulative distribution of the amount of the additive 
genetic variance of a trait ( Va, jy-axis) explained by SNPs with a 
single-marker association test P-value less than a given threshold 
(x-axis) for additional models of the trait. (A-C) A SNP's effect on 
the trait is correlated with its effect on fitness (x = 0.5). Note that 
the population that experienced recent growth (green line; BN+ 
growth) has a lower proportion of Va accounted for by SNPs at 
any P-value threshold than the populations that did not recendy 
expand (orange and purple lines; BN and Old growth). (D— F) A 
SNP's effect on the trait is independent of its effect on fitness 
(x = 0). Note that the SNPs with low P-values (<0.05) account for 
most of the Va regardless of the demographic history of the 
population. (A, D) A 2 , = 0.3 and M=70 kb. (B, E) A 2 , = 0.3 and 
M= 140 kb. (C, F) h 2 c = 0.l and M= 70 kb. 
(PDF) 

Figure S10 The relationship between a mutation's effect on the 
trait and its allele frequency. Statistics were calculated for each 
simulation replicate and then averaged over the 1000 simulation 
replicates. (A, C) A SNP's effect on the trait is correlated with its 
effect on fitness (x = 0.5). (B, D) A SNP's effect on the trait is 
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